Reinforcement Learning: An Introduction: The Foundations of Multi-Arm Bandit Problems

Welcome to the ultimate arena of decision-making under uncertainty. Imagine you are in a casino, facing a row of slot machines—the classic n-armed bandit problem. This is the fundamental nonassociative setting of reinforcement learning, where we strip away the complexity of changing environments to focus on one burning question: How do we choose the best action when we don't know the rules?

The Interaction Framework

Reinforcement learning is a considerable abstraction of goal-directed learning. At each time step $t = 0, 1, 2, \dots$, the agent perceives a state $S_t \in \mathcal{S}$, selects an action $A_t \in \mathcal{A}(S_t)$, and receives a reward $R_{t+1} \in \mathcal{R}$. In the bandit problem, the state is irrelevant, forcing us to master the action-selection policy through pure interaction.

Paradigm	Feedback Type	Learning Mechanism
Supervised Learning	Instructive (The "Right" Answer)	Pattern Matching
Bandit Problems	Evaluative (A Score)	Trial-and-Error Search

The Exploration-Exploitation Dilemma

Because the agent is never told the optimal action, it faces a paralyzing conflict. It must Exploit what it already knows to secure immediate rewards, but it must also engage in Active Exploration to uncover hidden gems that might yield even higher returns in the future. This tension distinguishes the bandit problem from static optimization and is the heartbeat of adaptive intelligence.

QUESTION 1

In the n-armed bandit problem, what is the primary difference between evaluative and instructive feedback?

Evaluative feedback provides the gradient of the error, while instructive feedback provides the reward.

Evaluative feedback tells you how good the action was; instructive feedback tells you what the best action should have been.

Evaluative feedback is used only in evolutionary methods, while instructive feedback is used in reinforcement learning.

QUESTION 2

What defines the 'nonassociative' nature of the basic multi-arm bandit problem?

The agent does not need to associate its actions with different states or situations.

The agent cannot associate rewards with specific actions.

The actions have no effect on the environment's internal state.

Case Study: The Clinical Trial Dilemma

Balancing Patient Care with Medical Discovery

A research doctor has three potential treatments for a new virus. Treatment A has shown a 70% success rate in 10 patients. Treatment B has shown 100% success in 2 patients. Treatment C has not been tested yet. The doctor needs to maximize the recovery of the next 100 patients.

1. Identify the 'Actions' and 'Rewards' in this scenario.

Answer:
The Actions ($A_t$) are the selection of Treatment A, B, or C for a patient. The Reward ($R_{t+1}$) is the binary outcome of patient recovery (1 for success, 0 for failure).

2. Explain how 'Exploitation' and 'Exploration' manifest here.

Answer:
Exploitation would be choosing Treatment A (the currently known best). Exploration would be choosing Treatment B (high potential but uncertain) or Treatment C (completely unknown) to find a potentially superior cure.

3. Why is this considered trial-and-error learning rather than supervised learning?

Answer:
There is no 'labeled' dataset telling the doctor which treatment is the absolute best for every patient in advance; the doctor only learns the quality of a treatment after administering it and observing the evaluative result.